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We show that the information-theoretic maximum entropy (MaxEnt) approach to deriving the 
canonical ensemble theory is mathematically equivalent to the classic approach of Boltzmann, Gibbs 
and Darwin- Fowler. The two approaches, however, "interpret" a same mathematical theorem dif- 
ferently; most notably observing mean-energy in the former and energy conservation in the latter. 
However, applying the same MaxEnt method to the grand canonical ensemble fails, while the correct 
statistics is obtained if one carefully follows the classic approach based on Boltzmann's microcanoni- 
cal egual probability a priori. One does not need to invoke quantum mechanics, and there is no Gibbs 
paradox. MaxEnt and related minimum relative entropy principle are based on the mathematical 
theorem concerning large deviations of rare fluctuations. As a scientific method, it requires classical 
mechanics or other assumptions to provide meaningful prior distributions for the expected-value 
based statistical inference. A naive assumption of uniform prior is not valid in statistical mechanics. 

PACS numbers: 



There exist several frameworks, based on informa- 
tion theory and/or statistical inferences [H-Qj which 
have been put forward as possible alternatives to the 
Boltzmann-Gibbsian foundation of statistical thermody- 
namics [1, @ . At the same time, there has been a grow- 
ing body of "statistical mechanical" studies of systems 
and processes with no thermal molecular origin, ranging 
from signal processing to combinatorial optimization, to 
neural networks f6!| . Almost all these studies claim Shan- 
non's information theory as their foundation 7]. These 
approaches are used to deduce probability distributions 
for fluctuating quantities which usually has no connection 
to Newtonian mechanics. 



This paper reexamines these approaches while clarify- 
ing and contrasting the differences between the classic 
approach to statistical mechanics and the new ones. The 
principle of maximum entropy (MaxEnt), or minimum 
Kullbak-Leibler relative entropy [8!] , which is at the heart 
of these information-based approach, was first proposed 
by E.T. Jaynes as an alternative foundation for statistical 
mechanics [l| . On the other hand, in information theory, 
following the collective work of many, it is now generally 
accepted that the minimum relative entropy principle is a 
mathematical theorem [9l-[Tl|. It is also known as Gibbs 
conditioning to probabilists [11] • Consider a sequence 
of identical, independently distributed (i.i.d.) discrete 
random variables Xi, X2, ■ ■ ■ ,A"„, with the state space 
S = {uji,ll>2, ■■■,^m} {M could be infinite) and prior dis- 
tribution Pr{Xi — LOm} — pra. (lu classical mechanics. 
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the LOi are called microstates.) The theorem states 
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where h{-) is a function defined on the state space S and 
p*j has the "canonical" form of e>^o+Ai/i(w,„) .^^^ rp^^ 

constants Xq and Ai are determined according to 



M 



= 1, 



M 

E 

m— 1 



(2) 



The first condition enforces normalization; the second 
one is interpreted as "conditioned on observing the mean 
value for /i(-)". We shall call pm the prior. The form of 
is exactly the same as that derived through minimiz- 
ing the relative entropy 



H{{p*m}\\{Pm})=yV. 
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under the constraints given in It is important to 

point out that MaxEnt is really the minimum relative 
entropy with uniform prior. 

This theorem is concerned with the conditional dis- 
tribution of a collection of individual samples, given that 
some quantity averaged over the large number of individ- 
ual samples shows highly unlikely behavior. Note that if 
the observed sample mean is the expected value of the 
prior, then Aq = Ai = 0. This theorem is closely related 
to the large deviation theory of empirical distribution 
(i.e., histogram) 
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where the delta function 5{Xi=w,„} = 1 if = w™; and 
zero otherwise. According to the strong law of large num- 
ber in probability theory, we know that this empirical 
distribution converges to the prior distribution {pm} of 
Xi. Moreover, the level- 2 Sanov theorem in large devia- 
tion theory states that for any set C of probability 
distribution, we have 

lim - lnPr{i„ e C} = - inf {/(^)}, (5) 

where /(/i) — J2"=i /^("^i) In [^i{LOi) / pi] is the relative en- 
tropy of fi with respect to the prior distribution pi. 
Therefore, the relative entropy can be interpreted as the 
"free energy" of deviation in the sense of a distribution 
p^ . And at this juncture, the two free energies, one 
from the theory of large deviation and one in the theory 
of Markov dynamics [la], agree. 

If we set C = {fj, : {h{-))fj, = a} as the space of prob- 
ability distribution with given constraints, then for any 
give distribution /x, 

Pr jin =^1^1] ^^^^^ = a| = Pr{L„ = fi\C} 0, 

(6) 

unless ^ = 11* where ^* satisfies /(/i*) — inf^(zc{I{n)}, 
i.e. with minimum relative entropy. It is implied that 
the empirical distribution L„ is dominated by fi* when 
n — > oo. Furthermore, since the distributions of different 
Xi under the constraint ^ J27=i H^i) — ^'^^ identical, 
the limiting distribution /i* for i„ also holds for each Xi. 

If one assumes uniform prior distribution in the canon- 
ical ensemble due to ignorance and constrains based on 
the observed mean energy h{-), then the posterior dis- 
tribution h{Xi) is just the exponential, canonical distri- 
bution. On the other hand, Jaynes [H argued that the 
entropy of statistical mechanics and the entropy in infor- 
mation theory are principally the same thing, and simply 
maximizing the entropy 

oo 

S{{pU)^Y.P*n^''P*^ (7) 

m — 1 

under some constraint on the mean observations would 
give the correct canonical distribution. Hence such an 
optimizing argument is mathematically equivalent to the 
previous theorem, and consequently statistical mechan- 
ics could be re- interpreted as a particular application of a 
gen eral theory of logical inference and information theory 
[17| . While Jaynes' approach to statistical mechanics, as 
well as the widely-used minimum relative entropy prin- 
ciple in information theory, is based on observations of 
mean-energy, the classic approach of Boltzmann, Gibbs 
and Darwin- Fowler to statistical mechanics interprets the 
same theorem differently. 

For the canonical ensemble, suppose it is a part of a 
larger microcanonical ensemble consisting of A'^ closed, 
identical canonical ensembles. Let Xi represent the mi- 
crostate of the z-th canonical ensemble, say momenta 



and positions (p, q) . Then the high dimensional vector 
X = (Xi, A2, Ajv) represents a microstate in the mi- 
crocannonical ensemble. Let the function e{Xi) be the 
energy of the i-th canonical ensemble, and the total en- 
ergy Etot = Sill The law of classical mechanical 
energy conservation says the X is only confined in the 
subspace {Etot — H} where H is the given total energy 
of the larger microcanonical ensemble. The notion of 
equal a priori probability further assumes that the prob- 
ability of X is equally distributed on such a subspace 
The marginal distribution of each Xi is then expo- 
nentially dependent on e{Xi) when A^ tends to infinity. 
Boltzmann's most probable state method and Darwin- 
Fowler's steepest descent method are all based on such a 
setup and are mathematically equivalent [19]. Note that 
Boltzmann, Gibbs and Darwin-Fowler deal with the con- 
vergence of empirical distributions as in Eq. ([5|) rather 
than marginal distribution as in the theorem ([T|). How- 
ever, when A^ tends to infinity, the limiting empirical 
distribution is the same as the limiting marginal distri- 
bution. 

The distribution for the high-dimensional microstate 
X in Boltzmann's approach, subjected to the energy con- 
servation Etot = H, is exactly the same as that of the 
MaxEnt approach conditioned on observing {e = H/N}. 
Hence they are mathematically equivalent. However, 
subtle differences exist in their interpretations: For the 
MaxEnt approach, one must first assume the existence 
of a prior distribution for the canonical ensemble even 
without a constraint on the mean energy. In the clas- 
sic approach, the equal a priori probability of the entire 
microcanonical ensemble can be verified from such a uni- 
form prior distribution of the independent subsystems 
without any constraint in the MaxEnt principle. That is 
why Jaynes called this framework "subjective thermody- 
namics" [ij. 

However, the reasnoning behind using a uniform prior 
distribution as the most suitable one when one knows 
nothing of any random variable is only empirical, and 
one must be very careful when applying it to a specific 
scientific problem. For example, if one only knows the 
mean particle number in a grand-canonical ensemble, this 
principle would conclude that the particle number distri- 
bution is likewise exponential (i.e., geometric). But the 
experimentally observed distribution is Poissonian when 
the particle is nearly independent, whether distinguish- 
able or not (see below for detailed discussion). Hence 
only justifying the form of the energy distribution in the 
canonical ensemble is not a sufficient proof for the valid- 
ity of the MaxEnt principle as substitute for the classical 
statistic mechanics. In other words, the maximum en- 
tropy or minimum relative entropy principle, by itself, 
can never tell you the prior distribution. The prior dis- 
tribution has to be supplied by the specific problem to 
which the principle is applied. Of course, for the pur- 
pose of data analysis exclusively, this technique could be 
quite useful in supplying a minimal model maximizing 
the degree of freedoms beyond the given constraints ^20] . 
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Professional statisticians would also use other methods 
to test the uniform prior hypothesis after the analysis of 
the data. 

Now the central question arises: What are exactly 
the prior distributions of energy and particle number for 
the grand-canonical ensemble? Gibbs tried to answer 
this question more than one hundred years ago, starting 
from the equal probability priori. His derivation for en- 
ergy fluctuation was highly successful, but for the grand- 
canonical ensemble with fluctuating particle numbers, a 
difficulty known as Gibbs paradox arises: Whether or not 
the phase space volume (j){E, v,n) — f d^^q d^^p used in 
grand-canonical ensemble should be divided by n\. It is 
now understood that for microcanonical or canonical en- 
sembles, both with fixed particle number, the paradox is 
not a well-defined problem . 

Similar to the deviation for canonical ensembles, and 
still suppose a large microcanonical ensemble with to- 
tal energy, volume and particle number invariant. The 
box further consists of N open, identical small grand- 
canonical ensembles each with fixed volume v and mean 
particle number (n). They are statistically identical but 
not rigorously independent. The phase-space uniformity 
states that the high-dimensional microstate space con- 
sists of all the N{n) particles in the large box uniformly 
distributed in position and momentum [l8| , and ask what 
is the distribution of particle numbers within a small 
grand-canonical ensemble. Hence the natural method- 
ology is to calculate the number of high-dimensional mi- 
crostates corresponding to a given energy E and particle 
numbers n in a grand-canonical volume. This number 
of high-dimensional microstates would give the weight 
(probability) of such a microstate in the smaller subsys- 
tem. The relation between a small subsystem and the 
rest of the "reservoir" is a rather subtle issue, which has 
been repeatedly emphasized in statistical physics. 

Textbooks [2l| often proceed in the same manner as in 
the treatment of the canonical ensemble through Boltz- 
mann's most probable distribution method. This is a 
little misleading. The key to this problem lies on how to 
go about reconstructing the high-dimensional microstate 
from those low-dimensional microstates for each subsys- 
tem. There is no problem for the canonical ensemble, 
since one can obtain the high-dimensional microstate 
simply from linking all the microstates of each subsys- 
tem together. However, in the case of distinguishable 
particles, we must take into account the partition of 
all N{n) particles into the N identical subsystems for 
the grand canonical ensemble. Let rrii be the number 
of grand canonical ensembles whose microstates contain- 
ing Hi particles with energy Ei. Hence, for any possible 
distribution {mi} of the microstates in grand canonical 
ensembles, there are two kinds of partitions: one is a 
partition of these occupation numbers {m^} into a to- 
tal of N subsystems; the other is the partition of all the 
N{n) particles (i.e. labeling particles) into the possible 
set {ui}. The canonical ensemble only deals with the 
former, and in textbooks, for grand canonical ensembles. 



they also only consider the former one, while the factorial 
n\ comes from the latter partition of particles [23|. 

Hence, the number of all high-dimensional microstates 
corresponding to {mi} is given by 



W{{m,}) = 



Nl 



iN{n))l 



subject to the three constrains: 

i 

EiiTii — Etot, 

i 

nirui = N{n), 

i 

which could be maximized to derive the correct statistics 
of grand canonical ensembles. 

For indistinguishable particles, the weight for each 
high-dimensional microstate in the large microcanoni- 
cal ensemble is already different from the distinguishable 
case and the factorial naturally arises due consideration 
of the phase space volume. Hence it is well-known that 
although the nl would not appear because of the partition 
of particles into small subsystems, it would emerge from 
the phase space volume in this case. Therefore, Gibbs 
paradox is definitely not related to quantum mechanics, 
and the partition function for grand-canonical ensemble 
should be written as 



E 



Q{E,v,n) 



-fin 



(8) 



where Q{E^ u, n) is the partition function for the canon- 
ical ensemble. 

For independent distinguishable particles, one could 
understand the n\ from another perspective. Due to the 
phase space uniformity assumption, the position distribu- 
tion for each particle is uniformly distributed in this large 
system with total volume Nv. Then at a certain time, 
the probability for each particle belonging to a specific 
subsystem is . Notice that the total number of particle 
is N{n), hence the distribution of the particle number in 
this subsystem is Binomial with parameter (A^(n),-^). 
When (n) is fixed and N tends to infinity, it converges to 
a Poisson distribution with mean (n). The factorial just 
comes out from the expression of the Poisson distribution 



A" 



Pri 



(9) 



where A = (n) . This is known as Poisson statistics for a 
point process, which has been experimentally verified in 
number fluctuation measurement based on fluorescence 
correlation spectroscopy (FCS) [l^l- Furthermore, when 
N tends to infinity, the positions of the particles must 
converge to the well-known Poisson point process and 
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the number of particles within a certain space is just its 
counting process. 

Let us now come back to the maximum entropy or 
minimum relative entropy principle. It is worth notic- 
ing that the phase space uniformity is of course another 
form of maximum entropy for the microcanonical ensem- 
ble without any additional constraint [l8| but is different 
from Jaynes' framework. There is even confusionregard- 
ing the fact that the derivation of the canonical ensemble 
distribution by Darwin-Fowler is an application of max- 
imum entropy approach. This is not the case. Althouth 
they are based on same mathematical theorem, they are 
definitely different interpretations. What Darwin-Fowler 
did was to derive the distribution of the subsystem from 
the whole phase space uniformity assumption fl^ . They 
did not mention anything like the uniform prior distri- 
bution of the subsystem. The most important element 
in Darwin-Fowler's interpretation is still the role of con- 
servation of energy at the level of a whole, isolated sys- 
tem, the First Law of Thermodynamics. They actually 
justified a special version of the law of large number in 
the empirical distribution space for canonical ensemble, 
and finally got the limiting distribution which was ex- 
actly Boltzmann's most probable state |19|]. We clarify 
a confusion regarding their terminology. The "mean" in 
their work is just the mean occupation number of each 
microstate of the subsystem, which is exactly the proba- 
bility rather than the real mean of the fluctuating energy. 

Jaynes' information approach to classical mechanics is 
a method of statistical inference based on macroscopic 
observables, i.e., expected values, in contrast to main 
stream statistics whose inferences are often based on 
samples. In both approaches, a prior in the absence of 
any measurement can only be subjective. In the present 
paper, we have shown that the Principle of Maximum 
Entropy can not fully replace the classical Boltzmann- 
Gibbs statistical mechanics precisely because the latter 
built their "prior" based on (1) uniformity in Newtonian 
mechanical phase space, and (2) conservation of energy, 
number, etc. These two assumptions are fundamentally 
outside any logical inference approach. The case in point 
is the grand canonical ensemble: mechanical phase space 
uniformity necessarily leads to a Poisson distribution as 
the prior for the number distribution of independent clas- 
sical particles in an infinitesimal open box. 

L. Szilard and B. Mandelbrot also advanced another 
line of interpretation for classical thermodynamics, called 
purely phenomenoloqical theory, based on the theory of 
sufficient statistics ^, ^3,] . Interestingly, it is also based on 



the above mentioned mathematical theorem, and it as- 
serts that all the macroscopic thermodynamic quantities 
are exactly the sufficient statistics of their microscopic 
fluctuations. The theory gives the correct distribution 
when a given ensemble has been perturbed but the new 
system still has the conservation law. It implies that all 
of the distributions in statistical mechanics must belong 
to the exponential family of probability distributions 25] . 

In the present study, we clarified E.T. Jaynes' Max- 
Ent approach to the statistical thermodynamic based on 
information theory, and its relation to classical statisti- 
cal mechanics. It is found that correctly determining a 
prior distribution is the central issue, which could not be 
addressed in general from only information theory or sta- 
tistical inference. Of course, as a mathematical theory, 
the theorem of minimum relative entropy could be ap- 
plied everywhere and not just be confined to mechanics or 
physics. It justifies the diverse use of "statistical mechan- 
ics" , and explains why it works as a fundamental tool in 
information theory. More importantly, the mathematical 
theorem also tells us that the concepts of entropy and rel- 
ative entropy are both mathematical constructions, both 
of which naturally arise in the asymptotic probability of 
large deviations 114!]. 

It is arguable that information theory, at least in its 
mathematical presentation, is a statistical theory en- 
dowed with the concept of entropy. This perspective 
naturally resolves a nagging issue that troubles "informa- 
tion" as a more general theoretic concept: The relation 
between information and knowledge f26j . It is well under- 
stood that thermodynamics is about what is impossible 
(for macroscopic systems) and what is very unlikely. It 
provides constraints on molecular processes, but it can- 
not specify their mechanisms. Knowledge is ultimately 
in the mechanism. There seems to be a contradistinction 
between "statistics" and "knowledge." 

There is another, dynamic origin of the concept of en- 
tropy and relative entropy (or free energy). It has been 
shown recently that they are emergent properties of any 
Markovian processes hjl. The original Shannon's infor- 
mation theory for coding, however, is a static one. 

We thank Ken Dih, Jin Feng, Michael Fisher, Chris 
Jarzynski, Steve Presse, Jin Wang and Ziqing Zhao for 
stimulating discussions. H. Ge acknowledges support by 
NSFC 10901040 and specialized Research Fund for the 
Doctoral Program of Higher Education (New Teachers) 
20090071120003. H. Ge thanks Prof. X.S. Xie and mem- 
bers of his group for hospitality and support. 



[1] E.T. Jaynes, Phys. Rev. 106, 620 (1957); Phys. Rev. 
108, 171 (1957). 

[2] B.B. Mandelbrot, Ann. Math. Stat. 33, 1021 (1962). 

[3] L. Szilard, Zeit. Physik. 32 753 (1972); English trans- 
lation in The Collected Works of Leo Szilard: Scientific 
Papers, B.T. Feld and G.W. Szilard eds., MIT Press, 



Cambridge, MA (1972) pp. 70-102. 
[4] L. Boltzmann, Lectures on Gas Ttieory, Translated by 

S.G. Brush (UC Press, Berkeley, 1964). 
[5] J.W. Gibbs, The Scientific Papers of J. Willard Gibbs 

(Dover, New York, 1961). 
[6] A.L. Yuille, Neural Comput. 2, 1 (1990); P.D. Simic, 



5 



Network: Comp. Neur. Sys. 1, 89 (1990). 
[7] T.M. Cover and J. A. Thomas, Elements of Information 

Theory (John Wiley & Sons, New York, 1991). 
[8] A.I. Khinchin, Mathematical Foundations of Information 

Theory (Dover, New York 1957); A. Hobson, J. Stat. 

Phys. 1, 383 (1969). 
[9] J.E. Shore and E.W. Johnson, IEEE Tr. Info. Th. IT-26, 

26 (1980). 

[10] J.M. van Campenhout and T.M. Cover, IEEE Tr. Info. 
Th. IT-27, 483 (1981). 

[11] J.R. Banavar, A. Maritan and I. Volkov, J. Phys.: Con- 
dens. Matter 22, 063101 (2010) 

[12] Gibbs conditioning in the theory of probability started 
with Khinchin's ^] and O. Lanford's work, which had 
been taken up by Stroock and Zeitouni, and became a 
part of modern theories of large deviation. For a brief his- 
torical account, see A. Dembo and O. Zeitouni, Large De- 
viation Techniques and Applications, 2nd ed. (Springer- 
Verlag: New York, 1998); J. Feng, Lecture Notes (un- 
published) (2011). 

[13] A.I. Khinchin, Mathematical Foundations of Statistical 
Mechanics (Dover, New York, 1960). 

[14] H. Touchette, Phys. Rep. 478, 1-69 (2009). 

[15] H. Qian, Phys. Rev. E. 63, 042103 (2001); P.G. 
Bergmann and J.L. Lebowitz, Phys. Rev. 99, 578 (1955). 

[16] H. Ge, Phys. Rev. E. 80, 021137 (2009); H. Ge and H. 
Qian, Phys. Rev. E. 81, 051133 (2010); M. Esposito and 
C. van den Broeck, Phys. Rev. Lett. 104, 090601 (2010); 



M. Santillan and H. Qian, Phys. Rev. E. 83, 041130 
(2011). 

[17] A. Ben-Naim, A Farewell to Entropy: Statistical Ther- 
modynamics based on Information (World Scientific, Sin- 
gapore, 2008). 

[18] The rigorous definition of equal a priori probability and 
entropy of a microcanonical ensemble for continuous 
cases are riddled with mathematical subtleties. We re- 
fer the reader to Khinchin's book [31 for details. 

[19] P.L. Ponczek and C.C. Yan, Revista Brasileira de Fisica 
6, 471 (1976). 

[20] E. Schneidman, M.J. Berry, R. Segev and W. Bialek, 

Nature, 440, 1007 (2006). 
[21] McQuarrie, D.A.: Statistical mechanics. (Viva Books 

Private Limited, New Delhi, 2005) 
[22] J.R. Ray, Eur. J. Phys. 5, 219 (1984); R., Baker, Theory 

of Heat, 2nd ed. (Springer, New York, 1967). 
[23] The history and different approaches to resolving Gibbs 

paradox is controversial. A more detailed account of the 

subject will be published elsewhere. 
[24] D. Magde, E.L. Elson and W.W. Webb, Phys. Rev. Lett. 

29, 705 (1972); H. Qian, Biophys. Chem. 38, 49 (1990). 
[25] H. JeflFreys, Proc. Cambridge Philos. Soc. 54, 393 (1960). 
[26] J. Gleick Information: A History. A Theory. A Flood 

(Pantheon Books, New York, 2011); G. Nunberg, NYT 

Rev. Book March 20, 10 (2011) 



